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Transcription is one of the essential processes for cells to read genetic information encoded in genes, 
which is initiated by the binding of RNA polymerase to related promoter. Experiments have found 
that the nucleotide sequence of promoter has great influence on gene expression strength, or promoter 
activity. In synthetic biology, one interesting question is how we can synthesize a promoter with 
given activity, and which positions of promoter sequence are important for determining its activity. 

In this study, based on recent experimental data, correlations between promoter activity and its 
sequence positions are analyzed by various methods. Our results show that, except nucleotides in 
the two highly conserved regions, —35 box and —10 box, influences of nucleotides in other positions 
are also not neglectable. For example, modifications of nucleotides around position —19 in spacing 
region may change promoter activity in a large scale. The results of this study might be helpful to 
our understanding of biophysical mechanism of gene transcription, and may also be helpful to the 
design of synthetic cell factory. 


I. INTRODUCTION 

In cells, generic information is transcribed from DNA template to messenger RNAs (mRNAs) by RNA polymerase 
(RNAP) through a series of complex processes, called transcription. The key step that starts transcription is the 
binding of RNAP to a special nucleotide sequence in DNA, which usually lies to the upstream of transcription start 
site of the gene, and is called promoter [1—4]. Experimental data show that expression strength of corresponding 
gene, or protein production rate, is greatly influenced by the nucleotide sequence of promoter [5-7]. Therefore, it is 
important for synthetic biology and genetic engineering to choose appropriate promoter to achieve needed expression 
strength. Meanwhile, experiments also find that the activity of a promoter does not change with genes it expresses 
[8—11]. It means that a promoter with strong activity in expression of one gene can always express other genes in 
relatively high strength. Therefore, it is biologically meaningful to establish promoter libraries with promoter strength 
(or activity) changing in a large scale, as those have been experimentally done in [7, 8, 12-16]. Meanwhile, in order to 
understand the regulation mechanism of promoter in gene expression, various nucleotide sequence dependent models 
are also designed, which are expected to be applicable in real cells [17-21]. 

For a constitutive promoter, i.e. its activity is not influenced by transcription factors, the initiation rate of tran¬ 
scription is mainly determined by RNAP binding rate to corresponding promoter. As almost all previous theoretical 
studies about promoter activity, this study assumes that the RNAP binding rate depends only on the nucleotide 
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sequence of promoter. It has been found that, in E. coli , promoters include two highly conserved hexamers, which 
are usually called —10 box and —35 box, and they are essential to promoter activity [2, 5, 6]. In some theoretical 
models, only nucleotides in these two regions, as well as the discriminator region and transcription start region, are 
considered in detail. While for the spacing region between —10 and —35 boxes, only the length of it is assumed to 
contribute to promoter activity as a model penalty term. [18]. However, recent experimental data presented in [16] 
show that promoter activity also changes with the nucleotide types in spacing region. 

Although in some theoretical studies, nucleotides in the spacing region are also included explicitly, correlations 
between sequence positions in spacing region and promoter activity are not discussed [6, 7, 17, 19, 20]. In this study, 
based on promoter libraries given in [9, 16], these correlations are calculated by various methods. Here, large values 
of correlation mean that promoter activity changes greatly with nucleotide type at this position, while small values of 
correlation mean that promoter activity is insensitive to the nucleotide type at this position. Our results show that, 
except sequence positions around the —10 box and —35 box, nucleotides at positions —20, —19, and —18, which lie in 
the spacing region, play important roles in determining promoter activity. On the contrary, nucleotides at positions 
—23 and —15 seem to be of no significance. 


II. RESULTS 

The importance of nucleotide hexamers in —35 box and —10 box of a promoter has been discussed previously 
[5, 23, 24], The main aim of this study is to find which positions in spacing region are important for promoter 
activity. In other words, modifications of nucleotides in these positions may change expression strength greatly. To 
achieve this, a linear model is designed to describe the relationship between nucleotide sequence and promoter activity, 
which is based on basic principles of statistical physics and the assumption that the strength of gene expression is 
proportionate to RNAP binding rate to promoter (see Sec. III). 

For the data from Wang’s study group [16], promoter sequences are only different in spacing region, i.e. from 
position —13 to —29. So in our analysis, only nucleotides in spacing region are considered. The data obtained by 
Mutalik et al in [9] consist of three groups, which are denoted by mpl, rpl, and pilot. But in no matter which group 
of them, nucleotide hexamers in —35 box and —10 box are not fixed. Therefore, nucleotide sequence from position 
— 1 to —35 are considered in our analysis. For each promoter, the length of its spacing region is 17. 

The relationship between nucleotide sequence and promoter activity is described by three k —mer models with 
k = 1,2,3. Here k —mer model assumes that promoter activity can be determined by all k adjacent nucleotide groups, 
and at each sequence position i, there are altogether 5 fc variables, which consist of all k —permutations of nucleotides A, 
T, G, C, and —, where ” means that the corresponding nucleotide is missed. Due to the large number of variables, 
partial least squares (PLS) regression is used in calculations. To improve model accuracy, 10-fold cross-validation is 
used in the PLS regression (see Sec. III). 

For each group of experimental data of promoter strength obtained in [9, 16], model coefficient values related to 
each sequence position can be obtained by PLS regression with 10-fold cross validation. We calculate their variance 
and range (defined as the difference between their maximum and minimum), which are regarded as two criteria of 
influence of the corresponding sequence position to promoter activity. The main difference between variance and 
range is that, variance is the average value of deviations of model coefficient values from their average while the range 
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only describes their variation range. 

The main aim of this study is to analyze the correlation between sequence position, especially those in spacing 
region, and promoter activity by using all the four data groups in [9, 16]. But the scales of variances and ranges 
obtained from the four data groups with three versions of k —mer model are different. If operating on these values 
directly, the information in small scale values will be absorbed by that in large scale values, and therefore they 
influence will be weakened inappropriately. To avoid this, we turn all variances and ranges into so-called scores by 
following method. We sort positions from 1 to 35 (17 for Wang’s data) by their variances (or ranges) in descending 
order. The position ranked n —th is scored 36 — n (scored 18 — n for Wang’ data). So far, for each data group, based 
on the variance and range of the three k —mer models, six scores can be obtained. For convenience, scores obtained 
by the variance and range of k —mer model are denoted by 14 and Rk respectively. At the same time, another score 
based on F-statistic can be obtained by the following idea. If promoter activity depends greatly on a position, then 
the precision of corresponding model with this position neglected will be low. So its 10-fold cross-validation error will 
be large. Using this 10-fold cross-validation error and by the same method as discussed above, one new score can be 
obtained for each sequence position, which is denoted by F for simplicity. Note that, in calculations of score F, the 
10-fold cross-validation errors are obtained by 1—mer model. 

For each group of data, we have altogether seven scores. For Wang’s data, the score of position in spacing region 
ranges from 1 to 17, while for other three groups of data, the score ranges from 1 to 35. To place all scores in the same 
level, linear transformations are applied to each of the 28 scores such that for each score, its minimal and maximal 
values corresponding to sequence positions in spacing region are 0 and 100 respectively. See Table SI-SIV in [22] for 
the processed data. 

Due to the differences of strains and measurement methods used in different experiments [9, 16], correlations 
between sequence position and promoter activity obtained from different data may be different. The seven scores as 
well as their average obtained from each data group are plotted in Figs, l(a-d), see also Table SI,SII,SIII,SIV for 
the score details [22]. Where x-axis shows the promoter sequence position, and y-axis shows the corresponding score. 
To distinguish the scores and their average, different markers and colors are used in Figs, l(a-d), which are listed 
at the bottom of Fig. 1. The average scores obtained from data groups mpl, rpl, and pilot measured in [9], see 
the thick black lines in Figs, l(c-d), show that the —35 box and —10 box are the most important regions, while 
the spacing region (the region between —35 and —10 boxes) is the least important region. Which is consistent with 
previous experimental observations. The scores of spacing region positions near these two boxes are also high. For 
data groups Wang, mpl, and rpl, positions around —19 also have high scores. However, according to data Wang, 
positions around —26 get higher scores than those around position —19. For data pilot, average score oscillates in 
spacing region and reaches its maximum at position —17. The average scores listed in Table SV (see [22]) show 
that, for data Wang, the four most important positions in spacing region are —26, —25, —21, —20, and the four least 
important positions are —15,—14,—13,—23. For data mpl, the most important position in spacing region is —20, 
and the least important position is —27. For data rpl, the most important positions in spacing region are —19, —18, 
while the least important positions are —24, —25, —26, —23. Finally, for data pilot, the most important positions in 
spacing region are —16, —29, and the least important positions are —14, —15 and —21. 

In the following, we will use three methods to put all the 28 scores together to find the most/least relevant positions 
in spacing region, which are expected to be generally true in any E. coli strain. One straight-forward method is to 
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calculate the weighted average of the four average scores obtained from the four data groups, with sample numbers as 
the weights. The sample numbers for data Wang, mpl, rpl, and pilot are 35, 69, 113, and 154 respectively. Since the 
sequence length of promoters in Wang’s data is different from those in data mpl, rpl, and pilot. We only average 
the four average scores in spacing region, i.e. from position —13 to position —29. In other regions, the overall average 
is obtained by the three data groups mpl, rpl, and pilot. All the 28 scores and their average value are plotted in 
Fig. 2(c). These plots show that, generally, the —10 and —35 boxes are more important for determining promoter 
activity, while the spacing region is the least relevant region. From the average scores listed in Table SV (see [22]), we 
found that, except the two positions —29, —13 which are adjacent to —35 or —10 boxes, the most important positions 
in spacing region are -20,-16 and —18, while the least relevant positions are —23 and —15. For convenience, this 
method to calculate the average score is call weighted score method (WSM). 

The second method to find the correlation of sequence position in spacing region to promoter activity is called 
four partitions method (FPM). Since for each data group, the sequence positions in spacing region are divided into 
four relevant groups, important group, sub-important group, sub-unimportant group, and unimportant group. In this 
method, sequence positions in spacing region are firstly sorted in descending order according to their average scores. 
Then the first five positions are assigned to the important group, and the rest 12 positions are assigned to the other 
three groups in turn, with four positions in each group, see Table SVII [22]. Then for each sequence position, we 
count the number of times it lies in the four relevant groups, see columns 2 to 5 in Table SIX [22]. According to this 
method, position —29, which is adjacent to the —35 box, is the most important position in spacing region because it 
is assigned to important group three times and assigned to sub-important group one time. The important positions 
in spacing region which are not adjacent to —10 or —35 boxes are —20 and —19, and the most unimportant positions 
are —23 and —15. 

The third method to discuss the correlation between sequence position and promoter activity is called signed-rank 
and rank-sum method (SRRSM). In which, according to the seven scores from each data group, the sequence positions 
are divided into three groups by Wilcoxon signed-rank test and Wilcoxon rank-sum test (see Sec. Ill), see Table SVIII 
in [22], According to these tests, the score of any position in the first group is larger than that of any position in the 
third group with significance level a = 0.05. For any position in spacing region, the number of times that it lies in 
given group is listed in Table SX [22]. Which shows that the most important positions are —29, —20, —19, —18, and 
— 16, while the most unimportant position is —23. 

The reason that we used seven different ways to calculate position score for each data group is that, generally, sort 
results of sequence positions obtained by different scoring methods are not the same, but we cannot know which one 
of them is more reasonable. In previous discussion, we simply used their average to sort sequence positions in spacing 
region of promoter. Another method to deal with this problem is to use data clustering method to exclude the scores 
that are much different from others. Or in other words, we only keep and average the scores that give similar sort 
results, and exclude the ones that the sort results obtained by them are peculiar (see Sec. III). Since we are mainly 
interested in the relevant of sequence positions in spacing region, in data clustering process only scores of the spacing 
region positions are used. The results of clustering are plotted in Figs. SI (a,b,c,d) [22]. In this study, scores with 
clustering distance larger than 0.55 will be excluded. For data Wang and mpl, no score is excluded. For data rpl, 
only the score F is excluded. While for data pilot, only scores V\, V 2 , V 3 , and R\ are kept. 

Through data clustering with distance criterion 0.55, there is no change for data Wang and mpl. The new average 
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scores of data rpl and pilot, obtained by averaging only the survival scores (scores with distance less than 0.55), are 
plotted in Fig. 2(a) and Fig. 2(b) respectively. See also Table SVI in [22] for their detailed values. Fig. 2(a) shows 
that, for data rpl, except the positions adjacent to —10 or —35 boxes, position —19 is the most important one, while 
position —26 is the most unimportant one. But for data pilot, the most important position is —16, and the most 
unimportant positions are —13, —14, —15. All the 23 survival scores and their weighted average, with sample numbers 
as weights, are plotted in Fig. 2(d). The sorted sequence positions in spacing region obtained by the data clustering 
method are listed in Table SVI [22], This clustering method gives that, except the positions adjacent to —10 or —35 
boxes, the most important and unimportant positions in spacing region are —16, —18, and —23, —15 respectively. This 
method is denoted by CWSM. 

The four partitions method can also be modified by using data clustering process, which is denoted by CFPM. 
The new four partitions of the four data groups are listed in Table SVII (see [22]), and the number of times that 
sequence position lies in given partition is summarized in Table SIX [22], which shows that positions —29, —19, and 
—20 are important while the positions —23 and —15 are unimportant. Similarly, the results of data clustering version 
of SRRSM (denoted by CSRRSM) are listed in Table SVIII and Table SX [22], Where the results for data group 
pilot are not presented since there are only four survival scores which are not enough to get reliable results from 
Wilcoxon signed-rank test or Wilcoxon rank-sum test. The results in Table SX show that positions —29, —18, —19, 
and —20 are more important, while position —23 seems to be unimportant [22]. 

Finally, the importances of sequence positions in spacing region obtained from the six methods are summarized in 
Table I [22]. It can be found that, in the region between position —27 and —15, positions —20, —19, and —18 play 
special roles in promoter activity. On the contrary, positions —23 and —15 seem to be of no significance. Here we 
neglect positions —29, —28, —14, and —13 because they are too closed to the —10 or —35 boxes. 


III. METHODS 

A. The fc-mer models 

Let y be gene expression strength. According to basic principles of statistical physics and the assumption that 
strength of gene expression is proportional to attachment rate of RNA polymerase (RNAP) to the upstream promoter, 
we have 


y = K exp[—AG/(fc s T)], (1) 

where AG is the energy barrier of RNAP attachment to promoter, ks is the Boltzmann constant. T is absolute 
temperature, and in this study, T = 300 K (37°C) is used. The constant K depends on all other experimental 
conditions such as the concentration of RNAP, the speed of transcription elongation and termination, as well as the 
speed of the following translation process. Therefore, the value of I\ will be different for data measured in different 
experiments. 

We assume that the energy barrier AG can be completely determined by the nucleotide sequence of promoter 
[21]. The simplest way to establish this relationship is to assume that each sequence position contributes to AG 
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independently and additively, and AG can be given by the following linear combination, 

AG = ^AG ijbi , ( 2 ) 

ieD 

in which D is the set of sequence positions, and 6 ^ € {—, A, T, G, C} is the nucleotide at position i (bi = — means the 
nucleotide at position i is missing). For data group Wang [16], 

D = {i\i £ Z,-29 < i < -13}, (3) 

while for data groups mpl, rpl, and pilot [9], 

D = {*|* £ Z, —35 < i < —!}• (4) 


Eqs. (1,2) are the so called 1—mer model. 

If the expression of AG is replaced by 

A G= J2 AG i.fc.fc+i> ( 5 ) 

i,i+l£D 

i.e. the total energy barrier is completely determined by all adjacent nucleotide groups of length 2 , then we get the 
2—mer model. Finally, if 


AG = Y, AG M + iM +2 (6) 

then the corresponding model is called 3—mer model. 

The 1—mer model (1,2) can be reformulated as follows 


log y = log K + E E 6 iib [-AGi, b /(kBT)], (7) 

ier>6e{-,A,T,G,c} 

where Si tb , for b £ {—, A, T, G, C}, is defined as follows 


Si, b 


1 if b = bi 
0 else 


( 8 ) 


For each promoter sample, its nucleotide sequence corresponds to a vector (• • • , a, di,T,di,G, c> ■ • • Values of 

log IF and (— AGi jb /(fcsT)) can be determined from measured data through partial least square regression (PLSR). 
Similarly, from the 2—mer model and 3—mer model, we obtained 


logy = loglF + E E E 8i,b,b'[— AG i,b,b'/(kBT)], (9) 

i,t+lei3 6e{-,A,T,G,C} b'e{-,A,T,G,C} 


and 


log y = log K + E E E E Si,b,b',b"[— AG it b,b',b"/ (fcsT 1 )], (10) 

ieD fee{w,A,T,G,C} b'e{-,A,T,G,C} b"e{-,A,T,G,C} 


respectively. Where 


S 


i,b,b f — 


1 if b = bi and b 1 = bi+ \ 
0 else 


i 3i,b,b' ,b" — 


1 if b = bi, b' = bi+ 1 , and b" = 2 

0 else 


( 11 ) 
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Note that, the sample numbers of data groups Wang, mpl, rpl, and pilot are 35, 69, 113, and 154 respectively. 
But the number of unknown variables in the above k —mer models may be very larger. For example, the 1—mer model 
for data group Wang includes 5 x 17 = 85 variables, and the 3—mer model for data groups mpl, rpl, and pilot 
includes 5 3 x (35 — 2) = 4125 variables. This is why this study uses PLSR but not LSR as usual. 

To avoid overfitting, the number of principal components in PLSR is determined by 10-folcl cross-validation. The 
promoter samples in each data group are randomly divided into 10 groups for 100 times. For each given number of 
principal components, we calculated the cross-validation error of each division, and then take their average as the cross- 
validation error of this component number. The error we used in this study is given by || logy — log j/||| = || log j//j/|||, 
where y is the predicted strength of gene expression. The number of principal components will be accepted when 
the corresponding cross-validation error reaches its minimal value. Finally, model coefficients are determined by all 
promoter samples in corresponding data group through PLSR with the previously determined principal component 
number. 


B. Scores corresponding to variance and range of model coefficients 

All scores of each sequence position are calculated from its related model coefficients. For the 1—mer models, the 
related coefficients of position i are the coefficients of <5.,,6; for b £ {—, A, T, G, C}. For the 2—mer models, the related 
model coefficients for position —35 < * < — 1 (or —29 < i < —13 for data Wang) are the coefficients of i,b,b' and 
8i,b,b', for 6,6' £ {—,A,T, G, C}. The related coefficients for positions i = —35 and i = — 1 are the coefficients of 
8-35,b,b' and 8-2,b,b', respectively. Similarly, for data group Wang, the related model coefficients for positions i = —29 
and i = —13 are the coefficients of 8-29, b,b' and <5_i4,b,t/, respectively. Finally, for the 3—mer models, the related 
model coefficients for position —34 < i < —2 (or —28 < i < —14 for data Wang) are the coefficients of Si-2,b,b',b" , 
8i-i,b,b',b ”, and 8i,b,b',b"- The model coefficients related to position i = —35 are the coefficients of 8-35,b,b',b" ■ The ones 
related to position i = —34 are the coefficients of 8 - 35 ^,b',b" and <5-34, b, 6',&"• The ones related to position i = —2 are 
the coefficients of <5_4, b,b',b" and S-3,b,b',b", and the ones related to position i = — 1 are the coefficients of 8-3,b,b',b"- 
For data group Wang, the model coefficients related to positions i = —29, —28, —14, —13 can be obtained similarly. 
Here, b , 5', b" £ { —, A, T, G, C}. From these related model coefficients, the variance 14 and range Rk in k —mer model 
can be obtained as described in section II. 

C. Scores corresponding to F-statistic 

For the 1—mer model given in Eq. (7), if the contribution to promoter activity from the nucleotide at position 
k G D is excluded, it will become 

log y = log K + E E &ifi[-bG ilh Hk B T)]. (12) 

6e{-,A,T,G,C} 

From this modified model, a new principal component number of optimal PLSR can be found, together with a new 
optimal cross-validation error. By ranking sequence positions according to the descending order of these optimal 
cross-validation errors, the score F can then be obtained as described in Sec. II. 


D. Wilcoxon signed-rank test and Wilcoxon rank-sum test 


Suppose that, for each data group, the seven scores of each sequence position i are independent and identically 
distributed random variables, then we can use Wilcoxon signed-rank test and Wilcoxon rank-sum test to show if the 
scores of a given position are significantly larger than those of others under a given significance a. In this study, 
a = 0.05 is used, and each pair of sequence positions is tested by these two methods. The difference between two 
positions is regarded to be significant if at least one of the two tests is significant. All sequence positions are then 
divided into three partitions such that any position in the first partition has significantly larger score than that of 
any one in the third partition. See Table SVIII in [22] for the results of partition. Finally, the importance of position 
can then be analyzed by the three partitions for the four data groups as described in Sec. II. 


E. Data clustering 

For the seven score vectors of each data group (see columns 2-8 in Tables SI-SIV [22]), data clustering is performed 
by following methods. The distance between any two score vectors x and y is defined as 1 — corr(a;, y), with corr(x, y) 
to be the correlation coefficient. Firstly, distance between any two score vectors is calculated. The two closest score 
vectors are then clustered and replaced by their average, which is considered as a new point but with weight 2 in the 
following calculation of average. Repeat this process until all score vectors are clustered together. In this study, two 
score vectors will be regarded to be in the same class if the distance between them is shorter than 0.55, see Fig. SI 
in [22], In other words, the correlation between them is larger than 0.45. Only score classes which include more than 
one score are consider, and the ones include only one score are excluded. Our calculations show that, for any one of 
the four data groups which used in this study, there exists only one effective data class. For such a special case, the 
above data clustering process is equivalent to an excluding process. 


IV. REMARK AND DISCUSSION 

It seems that the largest and smallest (negative) coefficients in various k —mer models are also reasonable evaluation 
criterions of sequence position importance. But it is not the case. For example, when model coefficients related to 
one sequence position are all very large, the corresponding correlation between this position and promoter activity 
may be not significant if their variance is very small. Since for such cases, the promoter activity is not sensitive to the 
nucleotide type in this position. In fact, one of the main aims of this study is to find that, in the synthesis process of 
promotor, the nucleotide in which positions should be chosen more carefully to achieve needed promoter activity, and 
the influence of nucleotide type in which positions can be neglectable. In other words, a sequence position has strong 
correlation with promoter activity means that, if the nucleotide in this position is not chosen properly, the promoter 
activity may vary in a large scale. 

Except the correlation between single sequence position and promoter activity, we have also tried to analyze the 
importance of adjacent sequence position groups, i.e. tried to find which position groups (with two or three adjacent 
positions) are important for determining promoter activity, and which ones are not. However, for these complex cases, 
only the k —mer models for k = 2,3 can be used. Therefore, there are only four or even two kinds of scores for each 
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data group, which are not enough to get reliable results about the position importance. Therefore, the corresponding 
results are not shown in the study. 

To analyze the importance of sequence positions in spacing region, in the initial stage of this study, we have tried to 
group the promoters in data groups mpl, rpl, and pilot by their —35 box and —10 box. Since after this pretreatment, 
in each group, the nucleotide sequences in —35 box and —10 box are the same, so the sequence positions in spacing 
region can be scored easily. However, after this grouping, sample numbers of each subgroups are usually too small 
(usually less than 10) to get reliable results. Meanwhile, it is also unreliable if we only consider the spacing region 
positions but neglect the difference in —35 and —10 boxes. Therefore, in this study, all sequence positions, including 
the ones in spacing region, —35 and —10 boxes, and the discriminator region, are scored simultaneously, though only 
the normalized ones in spacing region are finally used. 

In summary, the correlation between promoter sequence positions in spacing region, especially from position —27 
to —15, and its activity is discussed based on three k —mer models. From the data presented in [16], we found 
that position —26 in promoter sequence is the most important one to determine promoter activity. While the data 
groups mpl and rpl presented in [9] show that position —19 is the most important one, and data group pilot in [9] 
shows position —17 is the most important. These differences may be caused by the different E. coli strains used in 
experiments. To find the most important/unimportant positions that might be generally true for any E. coli strain, 
three methods, WSM, FPM, and SRRSM, are used to integrate all the 28 scores obtained from four data groups. 
The results suggest that positions around —19 display strong correlations with promoter activity, while the nucleotide 
type at position —23 is almost irrelevant to promoter activity. Meanwhile, three modified methods are also used in 
the analysis, in which scores are firstly clustered to exclude the peculiar ones. But similar results are obtained, see 
Table I in [22], 
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WSM 

CWSM 

FPM 

CFPM 

SRRSM 

CSRRSM 

Important 

-20 -19 -18 -17 -16 

-16 

-20 -19 

-20 -19 

-20 -19 -18 -16 

-20 -19 -18 

Unimportant 

-23 -15 

-23 -15 

-23 -15 

-23 -15 

-23 

-23 


TABLE I: Summary of results obtained from six methods to find correlations between sequence positions of promoter and its 
activity. The positions which have strong correlation with promoter activity are listed in the first row, and those have weak 
correlations are listed in the second row. The positions adjacent to —10 or —35 boxes usually have strong correlations with 
promoter activity. Therefore, in this table only the positions between —27 to —15 are presented. 
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FIG. 1: The normalized scores 14, Rk (for k=l,2,3), and F for data group Wang presented in [16] (a), and data groups mpl, 
rpl, and pilot presented in [9] (b,c,d). The thick black line is the average of 14, Rk and F. The x-axis is nucleotide sequence 
position in promoter. Where nucleotide hexamer from position —7 to —12 is called —10 box, and nucleotide hexamer from 
position —30 to —35 is called —35 box. Scores 14 and Rk are obtained from the variance and the range of model coefficients 
in k —mer model, respectively. Score F is obtained from F-statistic of model coefficients in 1—mer model. For detailed score 
values, see Tables SI-SIV in [22]. 
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FIG. 2: Survival scores after data clustering and their average for data groups rpl (a) and pilot (b) measured in [9], with the 
same legend as in Fig. 1. For data groups Wang and mpl, all scores are survived after data clustering. See Fig. SI in [22] for 
results of the data clustering, (c) All scores for the four data groups and their weighted average (thick black line), with sample 
numbers as the weights, (d) The survived 23 scores after data clustering and their weighted average. In (c,d), scores for data 
groups Wang, mpl, rpl, and pilot are marked by ’o’, ’x’, ’+’, and ’*’ respectively. For detailed values of average scores, see 
Tables. SV and SVI in [22]. 






